25 research outputs found
Beyond the One Step Greedy Approach in Reinforcement Learning
The famous Policy Iteration algorithm alternates between policy improvement
and policy evaluation. Implementations of this algorithm with several variants
of the latter evaluation stage, e.g, -step and trace-based returns, have
been analyzed in previous works. However, the case of multiple-step lookahead
policy improvement, despite the recent increase in empirical evidence of its
strength, has to our knowledge not been carefully analyzed yet. In this work,
we introduce the first such analysis. Namely, we formulate variants of
multiple-step policy improvement, derive new algorithms using these definitions
and prove their convergence. Moreover, we show that recent prominent
Reinforcement Learning algorithms are, in fact, instances of our framework. We
thus shed light on their empirical success and give a recipe for deriving new
algorithms for future study.Comment: ICML 201
Reinforcement Learning with Trajectory Feedback
The standard feedback model of reinforcement learning requires revealing the
reward of every visited state-action pair. However, in practice, it is often
the case that such frequent feedback is not available. In this work, we take a
first step towards relaxing this assumption and require a weaker form of
feedback, which we refer to as \emph{trajectory feedback}. Instead of observing
the reward obtained after every action, we assume we only receive a score that
represents the quality of the whole trajectory observed by the agent, namely,
the sum of all rewards obtained over this trajectory. We extend reinforcement
learning algorithms to this setting, based on least-squares estimation of the
unknown reward, for both the known and unknown transition model cases, and
study the performance of these algorithms by analyzing their regret. For cases
where the transition model is unknown, we offer a hybrid optimistic-Thompson
Sampling approach that results in a tractable algorithm.Comment: AAAI202
Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning
Multiple-step lookahead policies have demonstrated high empirical competence
in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model
Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step
greedy policies and their use in vanilla Policy Iteration algorithms were
proposed and analyzed. In this work, we study multiple-step greedy algorithms
in more practical setups. We begin by highlighting a counter-intuitive
difficulty, arising with soft-policy updates: even in the absence of
approximations, and contrary to the 1-step-greedy case, monotonic policy
improvement is not guaranteed unless the update stepsize is sufficiently large.
Taking particular care about this difficulty, we formulate and analyze online
and approximate algorithms that use such a multi-step greedy operator.Comment: NIPS 201
Tractable Optimality in Episodic Latent MABs
We consider a multi-armed bandit problem with latent contexts, where an
agent interacts with the environment for an episode of time steps.
Depending on the length of the episode, the learner may not be able to estimate
accurately the latent context. The resulting partial observation of the
environment makes the learning task significantly more challenging. Without any
additional structural assumptions, existing techniques to tackle partially
observed settings imply the decision maker can learn a near-optimal policy with
episodes, but do not promise more. In this work, we show that learning
with {\em polynomial} samples in is possible. We achieve this by using
techniques from experiment design. Then, through a method-of-moments approach,
we design a procedure that provably learns a near-optimal policy with
interactions. In
practice, we show that we can formulate the moment-matching via maximum
likelihood estimation. In our experiments, this significantly outperforms the
worst-case guarantees, as well as existing practical methods.Comment: NeurIPS 202